NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Health diagnosis associated with COVID-19 death in the United States: A retrospective cohort study using electronic health records

Joseph, Mariam; Li, Qiwei; Shin, Sunyoung (March 2025, PloS one)

Background The United States has experienced high surge in COVID-19 cases since the dawn of 2020. Identifying the types of diagnoses that pose a risk in leading COVID-19 death casualties will enable our community to obtain a better perspective in identifying the most vulnerable populations and enable these populations to implement better precautionary measures. Objective To identify demographic factors and health diagnosis codes that pose a high or a low risk to COVID-19 death from individual health record data sourced from the United States. Methods We used logistic regression models to analyze the top 500 health diagnosis codes and demographics that have been identified as being associated with COVID-19 death. Results Among 223,286 patients tested positive at least once, 218,831 (98%) patients were alive and 4,455 (2%) patients died during the duration of the study period. Through our logistic regression analysis, four demographic characteristics of patients; age, gender, race and region, were deemed to be associated with COVID-19 mortality. Patients from the West region of the United States: Alaska, Arizona, California, Colorado, Hawaii, Idaho, Montana, Nevada, New Mexico, Oregon, Utah, Washington, and Wyoming had the highest odds ratio of COVID-19 mortality across the United States. In terms of diagnoses, Complications mainly related to pregnancy (Adjusted Odds Ratio, OR:2.95; 95% Confidence Interval, CI:1.4 - 6.23) hold the highest odds ratio in influencing COVID-19 death followed by Other diseases of the respiratory system (OR:2.0; CI:1.84 – 2.18), Renal failure (OR:1.76; CI:1.61 – 1.93), Influenza and pneumonia (OR:1.53; CI:1.41 – 1.67), Other bacterial diseases (OR:1.45; CI:1.31 – 1.61), Coagulation defects, purpura and other hemorrhagic conditions(OR:1.37; CI:1.22 – 1.54), Injuries to the head (OR:1.27; CI:1.1 - 1.46), Mood [affective] disorders (OR:1.24; CI:1.12 – 1.36), Aplastic and other anemias (OR:1.22; CI:1.12 – 1.34), Chronic obstructive pulmonary disease and allied conditions (OR:1.18; CI:1.06 – 1.32), Other forms of heart disease (OR:1.18; CI:1.09 – 1.28), Infections of the skin and subcutaneous tissue (OR: 1.15; CI:1.04 – 1.27), Diabetes mellitus (OR:1.14; CI:1.03 – 1.26), and Other diseases of the urinary system (OR:1.12; CI:1.03 – 1.21). Conclusion We found demographic factors and medical conditions, including some novel ones which are associated with COVID-19 death. These findings can be used for clinical and public awareness and for future research purposes.
more » « less
Free, publicly-accessible full text available March 31, 2026
Scalable test of statistical significance for protein-DNA binding changes with insertion and deletion of bases in the genome

https://doi.org/10.1214/24-AOAS1950

Zhou, Qinyi; Zuo, Chandler; Zhang, Yuannyu; Chen, Min; Xu, Jian; Shin, Sunyoung (December 2024, The Annals of Applied Statistics)

Full Text Available
A Generalized Formulation for Group Selection via ADMM

https://doi.org/10.1007/s10915-024-02571-9

Ke, Chengyu; Shin, Sunyoung; Lou, Yifei; Ahn, Miju (May 2024, Journal of Scientific Computing)

Abstract This paper studies a statistical learning model where the model coefficients have a pre-determined non-overlapping group sparsity structure. We consider a combination of a loss function and a regularizer to recover the desired group sparsity patterns, which can embrace many existing works. We analyze directional stationary solutions of the proposed formulation, obtaining a sufficient condition for a directional stationary solution to achieve optimality and establishing a bound of the distance from the solution to a reference point. We develop an efficient algorithm that adopts an alternating direction method of multiplier (ADMM), showing that the iterates converge to a directional stationary solution under certain conditions. In the numerical experiment, we implement the algorithm for generalized linear models with convex and nonconvex group regularizers to evaluate the model performance on various data types, noise levels, and sparsity settings.
more » « less
Bayesian hidden mark interaction model for detecting spatially variable genes in imaging-based spatially resolved transcriptomics data

https://doi.org/10.3389/fgene.2024.1356709

Yang, Jie; Jiang, Xi; Jin, Kevin Wang; Shin, Sunyoung; Li, Qiwei (April 2024, Frontiers in Genetics)

Recent technology breakthroughs in spatially resolved transcriptomics (SRT) have enabled the comprehensive molecular characterization of cells whilst preserving their spatial and gene expression contexts. One of the fundamental questions in analyzing SRT data is the identification of spatially variable genes whose expressions display spatially correlated patterns. Existing approaches are built upon either the Gaussian process-based model, which relies onad hockernels, or the energy-based Ising model, which requires gene expression to be measured on a lattice grid. To overcome these potential limitations, we developed a generalized energy-based framework to model gene expression measured from imaging-based SRT platforms, accommodating the irregular spatial distribution of measured cells. Our Bayesian model applies a zero-inflated negative binomial mixture model to dichotomize the raw count data, reducing noise. Additionally, we incorporate a geostatistical mark interaction model with a generalized energy function, where the interaction parameter is used to identify the spatial pattern. Auxiliary variable MCMC algorithms were employed to sample from the posterior distribution with an intractable normalizing constant. We demonstrated the strength of our method on both simulated and real data. Our simulation study showed that our method captured various spatial patterns with high accuracy; moreover, analysis of a seqFISH dataset and a STARmap dataset established that our proposed method is able to identify genes with novel and strong spatial patterns.
more » « less
Full Text Available
Iteratively Reweighted Group Lasso Based on Log-Composite Regularization

https://doi.org/10.1137/20M1349072

Ke, Chengyu; Ahn, Miju; Shin, Sunyoung; Lou, Yifei (January 2021, SIAM Journal on Scientific Computing)

Full Text Available
Ensemble estimation and variable selection with semiparametric regression models

https://doi.org/10.1093/biomet/asaa012

Shin, Sunyoung; Liu, Yufeng; Cole, Stephen R; Fine, Jason P (April 2020, Biometrika)

Summary We consider scenarios in which the likelihood function for a semiparametric regression model factors into separate components, with an efficient estimator of the regression parameter available for each component. An optimal weighted combination of the component estimators, named an ensemble estimator, may be employed as an overall estimate of the regression parameter, and may be fully efficient under uncorrelatedness conditions. This approach is useful when the full likelihood function may be difficult to maximize, but the components are easy to maximize. It covers settings where the nuisance parameter may be estimated at different rates in the component likelihoods. As a motivating example we consider proportional hazards regression with prospective doubly censored data, in which the likelihood factors into a current status data likelihood and a left-truncated right-censored data likelihood. Variable selection is important in such regression modelling, but the applicability of existing techniques is unclear in the ensemble approach. We propose ensemble variable selection using the least squares approximation technique on the unpenalized ensemble estimator, followed by ensemble re-estimation under the selected model. The resulting estimator has the oracle property such that the set of nonzero parameters is successfully recovered and the semiparametric efficiency bound is achieved for this parameter set. Simulations show that the proposed method performs well relative to alternative approaches. Analysis of an AIDS cohort study illustrates the practical utility of the method.
more » « less
Full Text Available

Search for: All records